Loan Default Prediction¶
Problem Definition¶
A major proportion of retail bank profit comes from interests in the form of home loans. These loans are borrowed by regular income/high-earning customers. Banks are most fearful of defaulters, as bad loans (NPA) usually eat up a major chunk of their profits. Therefore, it is important for banks to be judicious while approving loans for their customer base. The approval process for the loans is multifaceted. Through this process, the bank tries to check the creditworthiness of the applicant on the basis of a manual study of various aspects of the application. The entire process is not only effort-intensive but also prone to wrong judgment/approval owing to human error and biases. There have been attempts by many banks to automate this process by using heuristics. But with the advent of data science and machine learning, the focus has shifted to building machines that can learn this approval process and make it free of biases and more efficient. At the same time, one important thing to keep in mind is to make sure that the machine does not learn the biases that previously crept in because of the human approval process
The Context:¶
A bank's consumer credit department aims to simplify the decision-making process for home equity lines of credit to be accepted. To do this, they will adopt the Equal Credit Opportunity Act's guidelines to establish an empirically derived and statistically sound model for credit scoring. The model will be based on the data obtained via the existing loan underwriting process from recent applicants who have been given credit. The model will be built from predictive modeling techniques, but the model created must be interpretable enough to provide a justification for any adverse behavior (rejections).
The objective:¶
Build a classification model to predict clients who are likely to default on their loan and give recommendations to the bank on the important features to consider while approving a loan.
The key questions:¶
- What are the borrow characteristics that lead to default
The problem formulation:¶
- Predict likelihoood of loan going bad and what are key characteristics or profile of such loans
Data Description:¶
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
Import the necessary libraries and Data¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score,precision_score,recall_score,f1_score,roc_curve, auc
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
import scipy.stats as stats
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
Data Overview¶
- Reading the dataset
- Understanding the shape of the dataset
- Checking the data types
- Checking for missing values
- Checking for duplicated values
Reading the dataset
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
hm=pd.read_csv("/content/drive/My Drive/GL_MIT_ADSP/Week9_Capstone/data/hmeq.csv")
data = hm.copy()
Data checks
hm.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
# check the number of rows and columns
hm.shape
(5960, 13)
# check the data types and missings
hm.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
hm.isna().sum()/hm.shape[0]*100
| 0 | |
|---|---|
| BAD | 0.000000 |
| LOAN | 0.000000 |
| MORTDUE | 8.691275 |
| VALUE | 1.879195 |
| REASON | 4.228188 |
| JOB | 4.681208 |
| YOJ | 8.640940 |
| DEROG | 11.879195 |
| DELINQ | 9.731544 |
| CLAGE | 5.167785 |
| NINQ | 8.557047 |
| CLNO | 3.724832 |
| DEBTINC | 21.258389 |
hm.duplicated().sum()
np.int64(0)
hm.nunique()
| 0 | |
|---|---|
| BAD | 2 |
| LOAN | 540 |
| MORTDUE | 5053 |
| VALUE | 5381 |
| REASON | 2 |
| JOB | 6 |
| YOJ | 99 |
| DEROG | 11 |
| DELINQ | 14 |
| CLAGE | 5314 |
| NINQ | 16 |
| CLNO | 62 |
| DEBTINC | 4693 |
# creating list of numeric and categorical variables
num_cols = hm.select_dtypes(include=np.number).columns.tolist()
cat_cols = hm.select_dtypes(include='object').columns.tolist()
# changing 'BAD' variable from
num_cols.remove('BAD')
cat_cols.append('BAD')
print(num_cols)
print(cat_cols)
['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC'] ['REASON', 'JOB', 'BAD']
Insights
- Dataset has 5,960 rows and 13 columns
- 11 variable are numerics and 2 variables are non-numerics.
- There are no duplicates in the dataset.
- Except 'BAD' and 'LOAN' , remaining variables have missing values 'DEBTINC' has 21% and 'DELINQ' has 11% missing, other 9 variables have less then 10% missings.
Summary Statistics¶
hm.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BAD | 5960.0 | NaN | NaN | NaN | 0.199497 | 0.399656 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| LOAN | 5960.0 | NaN | NaN | NaN | 18607.969799 | 11207.480417 | 1100.0 | 11100.0 | 16300.0 | 23300.0 | 89900.0 |
| MORTDUE | 5442.0 | NaN | NaN | NaN | 73760.8172 | 44457.609458 | 2063.0 | 46276.0 | 65019.0 | 91488.0 | 399550.0 |
| VALUE | 5848.0 | NaN | NaN | NaN | 101776.048741 | 57385.775334 | 8000.0 | 66075.5 | 89235.5 | 119824.25 | 855909.0 |
| REASON | 5708 | 2 | DebtCon | 3928 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| JOB | 5681 | 6 | Other | 2388 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| YOJ | 5445.0 | NaN | NaN | NaN | 8.922268 | 7.573982 | 0.0 | 3.0 | 7.0 | 13.0 | 41.0 |
| DEROG | 5252.0 | NaN | NaN | NaN | 0.25457 | 0.846047 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 |
| DELINQ | 5380.0 | NaN | NaN | NaN | 0.449442 | 1.127266 | 0.0 | 0.0 | 0.0 | 0.0 | 15.0 |
| CLAGE | 5652.0 | NaN | NaN | NaN | 179.766275 | 85.810092 | 0.0 | 115.116702 | 173.466667 | 231.562278 | 1168.233561 |
| NINQ | 5450.0 | NaN | NaN | NaN | 1.186055 | 1.728675 | 0.0 | 0.0 | 1.0 | 2.0 | 17.0 |
| CLNO | 5738.0 | NaN | NaN | NaN | 21.296096 | 10.138933 | 0.0 | 15.0 | 20.0 | 26.0 | 71.0 |
| DEBTINC | 4693.0 | NaN | NaN | NaN | 33.779915 | 8.601746 | 0.524499 | 29.140031 | 34.818262 | 39.003141 | 203.312149 |
for col in cat_cols:
print(col)
print(hm[col].value_counts())
print(hm[col].value_counts(normalize=True)),
print('-----'*10)
REASON REASON DebtCon 3928 HomeImp 1780 Name: count, dtype: int64 REASON DebtCon 0.688157 HomeImp 0.311843 Name: proportion, dtype: float64 -------------------------------------------------- JOB JOB Other 2388 ProfExe 1276 Office 948 Mgr 767 Self 193 Sales 109 Name: count, dtype: int64 JOB Other 0.420349 ProfExe 0.224608 Office 0.166872 Mgr 0.135011 Self 0.033973 Sales 0.019187 Name: proportion, dtype: float64 -------------------------------------------------- BAD BAD 0 4771 1 1189 Name: count, dtype: int64 BAD 0 0.800503 1 0.199497 Name: proportion, dtype: float64 --------------------------------------------------
Observations from Summary Statistics
- Average Loan approved amount ~18,600, with minimum value of 1100 and maximum amount of 899,00. This column does not have any missing values.
- Average amount due on existing mortgage is $73,760 ,min value of ~2000 and maximum value of 399,550.
- Average value of current property is 101,776.
- less than 75% of loans 1 or more major derogatory and 1+ Delinquent tradelines
- Average number of Inquiries is ~1.2; average age of oldest TL trade is ~21 months
- Average Debt to income ratio is ~34% and ~75% of loans have DTI below ~39%
- Average borrower work experience is 9 yrs , less than 25% of borrowers have experience of more than 16 months.
- 68% of loans were taken for Debt Consolidation.
- 42% of loans do not have definition job type. Less than 2% are in sales type of jon and ~3.3% are Self-employed
All columns except 'BAD' and 'LOAN' have missing values.
Exploratory Data Analysis (EDA) and Visualization¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Leading Questions:
- What is the range of values for the loan amount variable "LOAN"?
- How does the distribution of years at present job "YOJ" vary across the dataset?
- How many unique categories are there in the REASON variable?
- What is the most common category in the JOB variable?
- Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
- Do applicants who default have a significantly different loan amount compared to those who repay their loan?
- Is there a correlation between the value of the property and the loan default rate?
- Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?
Univariate Analysis¶
# plot histogram for numeric variable to understand the distribution
hm[num_cols].hist(figsize=(20,20),layout=(4,5) ,bins=20)
plt.show()
# Box plots to identify the outliers
hm[num_cols].plot(kind='box',subplots=True,layout=(4,4),figsize=(25,25))
| 0 | |
|---|---|
| LOAN | Axes(0.125,0.712609;0.168478x0.167391) |
| MORTDUE | Axes(0.327174,0.712609;0.168478x0.167391) |
| VALUE | Axes(0.529348,0.712609;0.168478x0.167391) |
| YOJ | Axes(0.731522,0.712609;0.168478x0.167391) |
| DEROG | Axes(0.125,0.511739;0.168478x0.167391) |
| DELINQ | Axes(0.327174,0.511739;0.168478x0.167391) |
| CLAGE | Axes(0.529348,0.511739;0.168478x0.167391) |
| NINQ | Axes(0.731522,0.511739;0.168478x0.167391) |
| CLNO | Axes(0.125,0.31087;0.168478x0.167391) |
| DEBTINC | Axes(0.327174,0.31087;0.168478x0.167391) |
All the numeric variables except age of oldest credit line and # of credit lines have significantly right skewed distributions and show presence of outliers in the data.
Bivariate Analysis¶
hm.groupby(['BAD'])[num_cols].mean()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|
| BAD | ||||||||||
| 0 | 19028.107315 | 74829.249055 | 102595.921018 | 9.154941 | 0.134217 | 0.245133 | 187.002355 | 1.032749 | 21.317036 | 33.253129 |
| 1 | 16922.119428 | 69460.452973 | 98172.846227 | 8.027802 | 0.707804 | 1.229185 | 150.190183 | 1.782765 | 21.211268 | 39.387645 |
On average, Loans which default
- have lower exiting mortgage due as compared to those who do not
- have slightly lower avg. years of experience at job ( 8 ys vs. 9 yrs for non-defaulters)
- Higher average # of derog and # of delinquency than non-defaulters
- Lower avg. age of oldest credit line opened and higher avg number of inquiries
- High Avg. Debt to income ratio
- have similar # of existing tradelines as non-defaulters
# Categorical variables vs. 'Bad'
for i in cat_cols[0:2]:
print( pd.crosstab(hm[i], hm['BAD'], normalize = 'index')*100)
mv = (pd.crosstab(hm[i], hm['BAD'], normalize = 'index')*100).plot(kind = 'bar', figsize = (8, 4), stacked = True)
plt.ylabel('% Bad ')
for p in mv.patches:
plt.annotate(f'{int(p.get_height())}',
(p.get_x() + p.get_width() / 2., p.get_y()),
ha='center', va='center', xytext=(0, 5), textcoords='offset points')
plt.show()
BAD 0 1 REASON DebtCon 81.033605 18.966395 HomeImp 77.752809 22.247191
BAD 0 1 JOB Mgr 76.662321 23.337679 Office 86.814346 13.185654 Other 76.800670 23.199330 ProfExe 83.385580 16.614420 Sales 65.137615 34.862385 Self 69.948187 30.051813
- Loans with 'Home Improvement' have marginally higher bad rate as compared those for 'Debt consolidation'
- Job type of 'Office' and 'Professional Exec' have lower bad rates as compared to remaining job types. 'Sales' and 'Self-employed' job type have among highest bad rates
plt.figure(figsize = [8, 6])
sns.scatterplot(data = hm ,x = hm.MORTDUE , y = hm.VALUE,hue='BAD')
plt.title('Amount due on existing Mortgage vs Property Value ')
plt.show()
There is no definite differentiation in loan profile by the loan amount due on existing mortgage and property value. Loans that default have higher property value than those that do not.
plt.figure(figsize = [8, 6])
sns.scatterplot(data = hm ,x = hm.YOJ , y = hm.CLAGE,hue='BAD')
plt.title('Years at Job vs Age of oldest Credit line ')
plt.show()
There is higher concentration of the default when years of experience is less than 10 years and age of oldest trade below 200 months.
plt.figure(figsize = [8, 6])
sns.scatterplot(data = hm ,x = hm.CLNO , y = hm.NINQ,hue='BAD')
plt.title('# of Existing TL vs # of Inquiries ')
plt.show()
Defaults are observed when there are 4 or more inquiries.
There is no definite pattern observed with # of existing credit lines.
The combination of 5+ inquiries and 20+ # of existing trades have higher defaults
plt.figure(figsize = [8, 6])
sns.scatterplot(data = hm ,x = hm.DEBTINC , y = hm.MORTDUE,hue='BAD')
plt.title('# Debt to Income vs existing mortgage due ')
plt.show()
There is higher concentration of defaults when Debt to income greater than 50% and mortgage amount is 500,000
Multivariate Analysis¶
plt.figure(figsize=(10,10))
sns.heatmap(hm[num_cols].corr(), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral")
plt.title("Correlation Heatmap")
plt.show()
Amount due on existing mortgage and property value are strongly correlation at 88%. The mortgage amount due and property value have correlation of ~30% with age of oldest credit line.
Treating Outliers¶
#Outlier treatment for DELINQ , DEROG,YOJ
hm['DELINQ_flag']=hm['DELINQ'].apply(lambda x: 6 if x >=6 else x)
hm['DEROG_flag']=hm['DEROG'].apply(lambda x: 7if x >=7 else x)
hm['YOJ_calc']=hm['YOJ'].apply(lambda x: 30 if x >=30 else x)
new_col = hm.columns.tolist()
hm[new_col[13:16]].hist()
array([[<Axes: title={'center': 'DELINQ_flag'}>,
<Axes: title={'center': 'DEROG_flag'}>],
[<Axes: title={'center': 'YOJ_calc'}>, <Axes: >]], dtype=object)
#Outlier treatment for DELINQ And DEROG
hm['DEBTINC_calc']=hm['DEBTINC'].apply(lambda x:60 if x >=60 else x)
hm['DEBTINC'].hist()
plt.show()
hm['DEBTINC_calc'].hist()
plt.show()
Post-outlier imputation, there is reduction in right skeweness of the distribution for debt to income.
Treating Missing Values¶
num_cols_1 = hm.select_dtypes(include=np.number).columns.tolist()
print(num_cols_1)
cat_cols_1 = hm.select_dtypes(include='object').columns.tolist()
print(cat_cols_1)
['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'DELINQ_flag', 'DEROG_flag', 'YOJ_calc', 'DEBTINC_calc'] ['REASON', 'JOB']
#Impute Missing values by mean value for numeric variable and mode for categorical variables
for i in num_cols_1:
hm[i].fillna(hm[i].mean(),inplace=True)
for i in cat_cols_1:
hm[i].fillna(hm[i].mode()[0],inplace=True)
for col in cat_cols:
print(col)
print("***Checking distribution prior-imputation****")
print(data[col].value_counts(normalize=True)),
print("***Checking distribution post-imputation****")
print(hm[col].value_counts(normalize=True)),
print('-----'*10)
#checking default rate before and after imputation
print('****BEFORE imputation ****')
print(pd.crosstab(data[col], hm['BAD'], normalize = 'index')*100)
print('****AFTER imputation ****')
print(pd.crosstab(hm[col], hm['BAD'], normalize = 'index')*100)
REASON ***Checking distribution prior-imputation**** REASON DebtCon 0.688157 HomeImp 0.311843 Name: proportion, dtype: float64 ***Checking distribution post-imputation**** REASON DebtCon 0.701342 HomeImp 0.298658 Name: proportion, dtype: float64 -------------------------------------------------- ****BEFORE imputation **** BAD 0 1 REASON DebtCon 81.033605 18.966395 HomeImp 77.752809 22.247191 ****AFTER imputation **** BAD 0 1 REASON DebtCon 81.028708 18.971292 HomeImp 77.752809 22.247191 JOB ***Checking distribution prior-imputation**** JOB Other 0.420349 ProfExe 0.224608 Office 0.166872 Mgr 0.135011 Self 0.033973 Sales 0.019187 Name: proportion, dtype: float64 ***Checking distribution post-imputation**** JOB Other 0.447483 ProfExe 0.214094 Office 0.159060 Mgr 0.128691 Self 0.032383 Sales 0.018289 Name: proportion, dtype: float64 -------------------------------------------------- ****BEFORE imputation **** BAD 0 1 JOB Mgr 76.662321 23.337679 Office 86.814346 13.185654 Other 76.800670 23.199330 ProfExe 83.385580 16.614420 Sales 65.137615 34.862385 Self 69.948187 30.051813 ****AFTER imputation **** BAD 0 1 JOB Mgr 76.662321 23.337679 Office 86.814346 13.185654 Other 78.365204 21.634796 ProfExe 83.385580 16.614420 Sales 65.137615 34.862385 Self 69.948187 30.051813 BAD ***Checking distribution prior-imputation**** BAD 0 0.800503 1 0.199497 Name: proportion, dtype: float64 ***Checking distribution post-imputation**** BAD 0 0.800503 1 0.199497 Name: proportion, dtype: float64 -------------------------------------------------- ****BEFORE imputation **** BAD 0 1 BAD 0 100.0 0.0 1 0.0 100.0 ****AFTER imputation **** BAD 0 1 BAD 0 100.0 0.0 1 0.0 100.0
Post-imputation, increase in the share of category that was applied to missing values by ~3% and the default rate for category show de-minimus change.
Feature Engineering
# Calculating variable existing loan amount due to poperty value
hm['MORTDUE_VALUE_ratio']=hm['MORTDUE']/hm['VALUE']
hm['MORTDUE_VALUE_ratio_1'] = hm['MORTDUE_VALUE_ratio'].apply(lambda x: 1.01 if x >1 else x)
hm.describe().T
hm['MORTDUE_VALUE_ratio_1'].hist()
<Axes: >
num_cols_1.append('MORTDUE_VALUE_ratio_1')
num_cols_1.remove('MORTDUE')
print(num_cols_1)
plt.figure(figsize=(10,10))
sns.heatmap(hm[num_cols_1].corr(), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral")
plt.title("Correlation Heatmap")
['BAD', 'LOAN', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC', 'DELINQ_flag', 'DEROG_flag', 'YOJ_calc', 'DEBTINC_calc', 'MORTDUE_VALUE_ratio_1']
Text(0.5, 1.0, 'Correlation Heatmap')
plt.figure(figsize = [8, 6])
sns.scatterplot(data = hm ,x = hm.MORTDUE_VALUE_ratio_1 , y =hm.DEBTINC ,hue='BAD')
plt.title('# Debt to Income vs Loan to Value ratio')
plt.show()
Loan with 50+ DTI and Amt due to value ratio have higher concentration of defaults
Important Insights from EDA¶
What are the the most important observations and insights from the data based on the EDA performed?
Data Checks • Dataset has 5,960 rows and 13 columns.11 variables are numeric, and 2 variables are non-numeric.
• Except 'BAD' and 'LOAN', remaining variables have missing values.
-'DEBTINC' has 21% and 'DELINQ' has 11% missing, other 9 variables are less than 10% missing.
Univariate analysis:
-Numeric variables except age of oldest credit line and # of credit lines have right skewed distributions. All numeric variables have outliers in the data.
Bivariate analysis:
-Average values of numeric variables were compared by default variable ‘BAD’ and it was observed that loans that default have following characteristics :
- 12% lower mortgage amount due and 8% lower property value
- 14% lower years of experience at job ( 8 ys vs. 9 yrs for non-defaulters)
- 5% higher # of derog and # of delinquency than non-defaulters
- 20% lower age of oldest credit line opened and 1.7 times higher avg number of inquiries -Higher Avg. Debt to income ratio There is no significant differeence in average # of existing tradelines across defaulters vs. non-defaulters -Loans with 'Home Improvement' have marginally higher bad rate as compared those for 'Debt consolidation'
- Job type of 'Office' and 'Professional Exec' have lower bad rates as compared to remaining
Multivariate analysis:
-There is higher concentration of defaults when Debt to income greater than 50% and mortgage amount is 500,000
-There is higher concentration of the default when years of experience is less than 10 years and age of oldest trade below 200 months.
-Defaults are observed when there are 4 or more inquiries. -The combination of 5+ inquiries and 20+ # of existing trades also have higher defaults
Outlier Treatment Missing value treatment:
-Due to highly skewed distribution on ‘DELINQ’,’DEROG’,’DEBTINC’ and ‘YOJ’ outlier treatment is applied on the numeric variables to achieve .
-Missing values in numeric variables were imputed with average values as none of variable had greater than 20% missings.
-For categorical variables, the missing values were imputed with the most frequently occuring category.
Model Building - Approach¶
- Data preparation
- Partition the data into train and test set
- Build the model
- Fit on the train data
- Tune the model
- Test the model on test set
Preparing data for modeling
Creating new feature and creating dummy variables for categorical variables
#Create dummy variables for categorical variables
hm_tomodel=pd.get_dummies(hm,columns=cat_cols[0:2],drop_first=False)
hm_tomodel.head(5)
| BAD | LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | ... | MORTDUE_VALUE_ratio | MORTDUE_VALUE_ratio_1 | REASON_DebtCon | REASON_HomeImp | JOB_Mgr | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0000 | 39025.000000 | 10.500000 | 0.00000 | 0.000000 | 94.366667 | 1.000000 | 9.000000 | ... | 0.662652 | 0.662652 | False | True | False | False | True | False | False | False |
| 1 | 1 | 1300 | 70053.0000 | 68400.000000 | 7.000000 | 0.00000 | 2.000000 | 121.833333 | 0.000000 | 14.000000 | ... | 1.024167 | 1.010000 | False | True | False | False | True | False | False | False |
| 2 | 1 | 1500 | 13500.0000 | 16700.000000 | 4.000000 | 0.00000 | 0.000000 | 149.466667 | 1.000000 | 10.000000 | ... | 0.808383 | 0.808383 | False | True | False | False | True | False | False | False |
| 3 | 1 | 1500 | 73760.8172 | 101776.048741 | 8.922268 | 0.25457 | 0.449442 | 179.766275 | 1.186055 | 21.296096 | ... | 0.724736 | 0.724736 | True | False | False | False | True | False | False | False |
| 4 | 0 | 1700 | 97800.0000 | 112000.000000 | 3.000000 | 0.00000 | 0.000000 | 93.333333 | 0.000000 | 14.000000 | ... | 0.873214 | 0.873214 | False | True | False | True | False | False | False | False |
5 rows × 25 columns
Partition the data into train and test set
#Separating the dataset into indepdent variables and target
X= hm_tomodel.drop(['BAD','DELINQ','DEROG','MORTDUE_VALUE_ratio','YOJ','DEBTINC','MORTDUE','LOAN'],axis=1)
y=hm_tomodel['BAD']
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VALUE 5960 non-null float64 1 CLAGE 5960 non-null float64 2 NINQ 5960 non-null float64 3 CLNO 5960 non-null float64 4 DELINQ_flag 5960 non-null float64 5 DEROG_flag 5960 non-null float64 6 YOJ_calc 5960 non-null float64 7 DEBTINC_calc 5960 non-null float64 8 MORTDUE_VALUE_ratio_1 5960 non-null float64 9 REASON_DebtCon 5960 non-null bool 10 REASON_HomeImp 5960 non-null bool 11 JOB_Mgr 5960 non-null bool 12 JOB_Office 5960 non-null bool 13 JOB_Other 5960 non-null bool 14 JOB_ProfExe 5960 non-null bool 15 JOB_Sales 5960 non-null bool 16 JOB_Self 5960 non-null bool dtypes: bool(8), float64(9) memory usage: 465.8 KB
Scaling the features
# Scaling the data to standardize the values with respect to the mean values for each variable
sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns = X.columns)
Creating training and test dataset for modeling
X_train , X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.3,random_state=42)
Function for Model Performance Comparison¶
#Function to evaluate Model Performance - Precision, REcall, Accuracy, F1 Score, Area Under the curve
from sklearn.metrics import confusion_matrix, classification_report,recall_score,precision_score, accuracy_score,f1_score
#Create function to measure peformance
def measure_performance(X,y,clf,m_name,m_num,show_accuracy=True,show_classification_report=True,show_confusion_matrix=True,):
y_pred=clf.predict(X)
recall = recall_score(y, y_pred,average = 'macro') # To compute recall
precision = precision_score(y, y_pred, average = 'macro') # To compute precision
acc = accuracy_score(y, y_pred)
f_score=f1_score(y, y_pred, average = 'macro')
if show_accuracy:
print("Accuracy:{0:.3f}".format(acc),"\n")
df_tmp = pd.DataFrame(
{
'num':m_num,
"model_name": m_name,
"Precision": precision,
"Recall": recall,
"Accuracy": acc,
"F1_score":f_score
},
index = [m_num],
)
if show_classification_report:
print("Classification report:")
print(classification_report(y, y_pred))
if show_confusion_matrix:
cm = confusion_matrix(y, y_pred)
print("Confusion Matrix")
print(cm)
sns.heatmap(cm, annot=True,fmt = '.2f', xticklabels = ['Actual Non-Default', 'Default'], yticklabels = ['Predicted Non-Default', 'Predicted Default'])
plt.show()
return df_tmp
Logistic Regression
# Logistic regression
lg = LogisticRegression(random_state = 1, max_iter=1000)
lg.fit(X_train, y_train)
LogisticRegression(max_iter=1000, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, random_state=1)
# Measure the performance of first Logistic Regression Model
df_perf_lg_a = measure_performance(X_train,y_train,lg,'Logistic Regression Train',1)
df_perf_lg_b = measure_performance(X_test,y_test,lg,'Logistic Regression Test',2)
print ("***********Model Performance****************")
print(df_perf_lg_a)
print(df_perf_lg_b)
Accuracy:0.843
Classification report:
precision recall f1-score support
0 0.85 0.97 0.91 3382
1 0.71 0.28 0.40 790
accuracy 0.84 4172
macro avg 0.78 0.63 0.66 4172
weighted avg 0.83 0.84 0.81 4172
Confusion Matrix
[[3293 89]
[ 568 222]]
Accuracy:0.817
Classification report:
precision recall f1-score support
0 0.83 0.96 0.89 1389
1 0.70 0.31 0.43 399
accuracy 0.82 1788
macro avg 0.77 0.64 0.66 1788
weighted avg 0.80 0.82 0.79 1788
Confusion Matrix
[[1337 52]
[ 276 123]]
***********Model Performance**************** num model_name Precision Recall Accuracy F1_score 1 1 Logistic Regression Train 0.783357 0.627348 0.842522 0.656281 num model_name Precision Recall Accuracy F1_score 2 2 Logistic Regression Test 0.765874 0.635417 0.816555 0.659655
# Coefficients and Odds Ratios
coefficients =lg.coef_[0]
odds_ratios = np.exp(coefficients)
# Display feature importance using coefficients and odds ratios
lg_feature_importance= pd.DataFrame({
'Feature': X.columns,
'Coefficient': coefficients,
'Odds Ratio': odds_ratios
})
print("\nFeature Importance (Coefficient and Odds Ratio):")
print(lg_feature_importance.sort_values(by='Coefficient', ascending=False))
Feature Importance (Coefficient and Odds Ratio):
Feature Coefficient Odds Ratio
4 DELINQ_flag 0.784337 2.190953
5 DEROG_flag 0.453479 1.573777
7 DEBTINC_calc 0.315807 1.371365
2 NINQ 0.275793 1.317575
15 JOB_Sales 0.201428 1.223148
10 REASON_HomeImp 0.083637 1.087234
13 JOB_Other 0.075669 1.078605
16 JOB_Self 0.073272 1.076023
0 VALUE 0.057312 1.058986
14 JOB_ProfExe 0.016971 1.017116
11 JOB_Mgr -0.024331 0.975962
9 REASON_DebtCon -0.083637 0.919765
8 MORTDUE_VALUE_ratio_1 -0.130489 0.877667
6 YOJ_calc -0.149173 0.861420
3 CLNO -0.165068 0.847836
12 JOB_Office -0.208894 0.811481
1 CLAGE -0.487084 0.614415
rfe_model = LogisticRegression(max_iter=10000, solver='liblinear',random_state=1,class_weight='balanced')
rfe = RFE(rfe_model, n_features_to_select=8)
rfe.fit(X_train, y_train)
rfe_features = X.columns[rfe.support_]
print("\nSelected Features by RFE:")
print('FEatures:',rfe_features)
print('Ranking:',rfe.ranking_)
df_perf_rfe_a = measure_performance(X_train,y_train,rfe,'Logistic Regression w. RFE Train',3)
df_perf_rfe_b = measure_performance(X_test,y_test,rfe,'Logistic Regression w RFe Test',4)
print ("***********Model Performance****************")
print(df_perf_rfe_a)
print(df_perf_rfe_b)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_Series = pd.Series([variance_inflation_factor(X_train[rfe_features].values,i) for i in range(X_train[rfe_features].shape[1])],index=X_train[rfe_features].columns)
print(vif_Series)
Selected Features by RFE:
FEatures: Index(['CLAGE', 'NINQ', 'CLNO', 'DELINQ_flag', 'DEROG_flag', 'DEBTINC_calc',
'JOB_Office', 'JOB_Sales'],
dtype='object')
Ranking: [ 8 1 1 1 1 1 3 1 4 2 5 9 1 6 10 1 7]
Accuracy:0.756
Classification report:
precision recall f1-score support
0 0.90 0.79 0.84 3382
1 0.41 0.63 0.49 790
accuracy 0.76 4172
macro avg 0.65 0.71 0.67 4172
weighted avg 0.81 0.76 0.77 4172
Confusion Matrix
[[2656 726]
[ 293 497]]
Accuracy:0.731
Classification report:
precision recall f1-score support
0 0.87 0.77 0.82 1389
1 0.43 0.61 0.50 399
accuracy 0.73 1788
macro avg 0.65 0.69 0.66 1788
weighted avg 0.77 0.73 0.75 1788
Confusion Matrix
[[1063 326]
[ 155 244]]
***********Model Performance**************** num model_name Precision Recall Accuracy \ 3 3 Logistic Regression w. RFE Train 0.653511 0.707224 0.755753 F1_score 3 0.666418 num model_name Precision Recall Accuracy \ 4 4 Logistic Regression w RFe Test 0.650406 0.688414 0.730984 F1_score 4 0.659554 CLAGE 1.080902 NINQ 1.067755 CLNO 1.128125 DELINQ_flag 1.058822 DEROG_flag 1.072238 DEBTINC_calc 1.050084 JOB_Office 1.012306 JOB_Sales 1.011224 dtype: float64
# Added post-Milestone Submission
# Identify the importance of features in the RFE model
rfe_coefficients = rfe.estimator_.coef_[0]
rfe_odds_ratios = np.exp(rfe_coefficients)
rfe_feature_importance= pd.DataFrame({
'Feature': X_train[rfe_features].columns,
'Coefficient': rfe_coefficients,
'Odds Ratio': rfe_odds_ratios})
print("\nFeature Importance (Coefficient and Odds Ratio) for RFE selected features:")
print(rfe_feature_importance.sort_values(by='Coefficient', ascending=False))
Feature Importance (Coefficient and Odds Ratio) for RFE selected features:
Feature Coefficient Odds Ratio
3 DELINQ_flag 0.780168 2.181838
4 DEROG_flag 0.508693 1.663116
5 DEBTINC_calc 0.326577 1.386214
1 NINQ 0.292439 1.339691
7 JOB_Sales 0.195805 1.216290
6 JOB_Office -0.247626 0.780652
2 CLNO -0.250977 0.778040
0 CLAGE -0.423836 0.654531
Decision Tree¶
# Building decision tree model
# Model #2
dt = DecisionTreeClassifier(random_state = 1,max_depth=3,class_weight='balanced')
dt.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=3, random_state=1)
# Measure the performance of first Decision tree Model
df_perf_dt_a = measure_performance(X_train,y_train,dt,'Decision Tree Train',5)
df_perf_dt_b = measure_performance(X_test,y_test,dt,'Decision Tree Test',6)
print ("***********Model Performance****************")
print(df_perf_dt_a)
print(df_perf_dt_b)
Accuracy:0.864
Classification report:
precision recall f1-score support
0 0.94 0.89 0.91 3382
1 0.62 0.74 0.67 790
accuracy 0.86 4172
macro avg 0.78 0.82 0.79 4172
weighted avg 0.88 0.86 0.87 4172
Confusion Matrix
[[3016 366]
[ 203 587]]
Accuracy:0.869
Classification report:
precision recall f1-score support
0 0.93 0.90 0.91 1389
1 0.68 0.76 0.72 399
accuracy 0.87 1788
macro avg 0.81 0.83 0.82 1788
weighted avg 0.88 0.87 0.87 1788
Confusion Matrix
[[1248 141]
[ 94 305]]
***********Model Performance**************** num model_name Precision Recall Accuracy F1_score 5 5 Decision Tree Train 0.776443 0.817409 0.863615 0.793676 num model_name Precision Recall Accuracy F1_score 6 6 Decision Tree Test 0.806906 0.83145 0.868568 0.817922
# Building decision tree model using GridSearch
dt_1 = DecisionTreeClassifier(random_state = 1)
param_grid = {'max_depth' : np.arange(2,7,1),
'criterion': ['gini', 'entropy'],
'min_samples_leaf' : [3,5,7] }
#use Grid Search
gridSearch = GridSearchCV(dt_1,param_grid, scoring='recall_weighted',cv=5)
gridSearch.fit(X_train,y_train)
best_params = gridSearch.best_params_
best_recall = gridSearch.best_score_
# print the best parameters for max depth and min number of nodes
print(best_params)
print(best_recall )
{'criterion': 'gini', 'max_depth': np.int64(6), 'min_samples_leaf': 3}
0.8787119286606643
dt_1 = DecisionTreeClassifier(random_state = 1, max_depth =best_params['max_depth'], min_samples_leaf = best_params['min_samples_leaf'],criterion=best_params['criterion'])
dt_1.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=np.int64(6), min_samples_leaf=3,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=np.int64(6), min_samples_leaf=3,
random_state=1)# Measure the performance of Decision tree Model using the
df_perf_dt1_a = measure_performance(X_train,y_train,dt_1,'DecisionTree_train_GV',7)
df_perf_dt1_b = measure_performance(X_test,y_test,dt_1,'DecisionTree_test_GV',8)
print ("************Model Performance****************")
print(df_perf_dt1_a)
print(df_perf_dt1_b)
Accuracy:0.895
Classification report:
precision recall f1-score support
0 0.93 0.94 0.94 3382
1 0.73 0.71 0.72 790
accuracy 0.90 4172
macro avg 0.83 0.82 0.83 4172
weighted avg 0.89 0.90 0.89 4172
Confusion Matrix
[[3176 206]
[ 232 558]]
Accuracy:0.878
Classification report:
precision recall f1-score support
0 0.91 0.94 0.92 1389
1 0.76 0.66 0.71 399
accuracy 0.88 1788
macro avg 0.83 0.80 0.81 1788
weighted avg 0.87 0.88 0.87 1788
Confusion Matrix
[[1307 82]
[ 137 262]]
************Model Performance**************** num model_name Precision Recall Accuracy F1_score 7 7 DecisionTree_train_GV 0.831146 0.822709 0.895014 0.82682 num model_name Precision Recall Accuracy F1_score 8 8 DecisionTree_test_GV 0.833376 0.798803 0.877517 0.813973
# combined performance from all the decision tree models
df_perf_combd_1 = pd.merge(df_perf_lg_a,df_perf_lg_b,how='outer')
df_perf_combd_1a = pd.merge(df_perf_rfe_a ,df_perf_rfe_b,how='outer')
df_perf_combd_2 = pd.merge(df_perf_dt_a ,df_perf_dt_b,how='outer')
df_perf_combd_3 = pd.merge(df_perf_dt1_a ,df_perf_dt1_b,how='outer')
df_perf_combd_t= pd.merge(df_perf_combd_1,df_perf_combd_1a,how='outer')
df_perf_combd_t1= pd.merge(df_perf_combd_t,df_perf_combd_2,how='outer')
df_perf_combd_all = pd.merge(df_perf_combd_t1,df_perf_combd_3,how='outer')
df_perf_combd_all
| num | model_name | Precision | Recall | Accuracy | F1_score | |
|---|---|---|---|---|---|---|
| 0 | 1 | Logistic Regression Train | 0.783357 | 0.627348 | 0.842522 | 0.656281 |
| 1 | 2 | Logistic Regression Test | 0.765874 | 0.635417 | 0.816555 | 0.659655 |
| 2 | 3 | Logistic Regression w. RFE Train | 0.653511 | 0.707224 | 0.755753 | 0.666418 |
| 3 | 4 | Logistic Regression w RFe Test | 0.650406 | 0.688414 | 0.730984 | 0.659554 |
| 4 | 5 | Decision Tree Train | 0.776443 | 0.817409 | 0.863615 | 0.793676 |
| 5 | 6 | Decision Tree Test | 0.806906 | 0.831450 | 0.868568 | 0.817922 |
| 6 | 7 | DecisionTree_train_GV | 0.831146 | 0.822709 | 0.895014 | 0.826820 |
| 7 | 8 | DecisionTree_test_GV | 0.833376 | 0.798803 | 0.877517 | 0.813973 |
# Logistice and Decision Tree Model - Feature IMportance comparison
columns = X.columns
importance_dt_md3 = dt.feature_importances_
importance_dt_gv = dt_1.feature_importances_
lg_feature_importance_1 = lg_feature_importance.sort_values(by='Odds Ratio', ascending=False)
rfe_feature_importance_1 = rfe_feature_importance.sort_values(by='Odds Ratio', ascending=False)
#importance_df_lg_rfe = pd.DataFrame(, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
importance_df_md_3 = pd.DataFrame(importance_dt_md3, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
importance_df_gv = pd.DataFrame(importance_dt_gv, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
fig, axes = plt.subplots(2, 2, figsize=(15,10)) # Adjust figsize for better spacing
sns.barplot(x=lg_feature_importance_1['Odds Ratio'], y=lg_feature_importance_1.Feature, ax=axes[0,0])
axes[0,0].set_title('**Feature Importance - Logistic Regression (Odds Ratio)**')
sns.barplot(x=rfe_feature_importance_1['Odds Ratio'] ,y=rfe_feature_importance_1.Feature, ax=axes[0,1])
axes[0,1].set_title('**Feature Importance - Logistic Regression w Recurssive Feature Elimination (Odds Ratio)**')
sns.barplot(x= importance_df_md_3.Importance, y=importance_df_md_3.index,ax=axes[1,0])
axes[1,0].set_title('**Feature Importance - Decision Tree Model - Max depth 3**')
sns.barplot(x=importance_df_gv.Importance,y=importance_df_gv.index,ax=axes[1,1])
axes[1,1].set_title('**Feature Importance - Decision Tree Model - Grid Search Max Depth =6 , Min sample leaf = 3**')
plt.tight_layout()
plt.show()
Building a Random Forest Classifier¶
Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.
The results from all the decision trees are combined together and the final prediction is made using voting or averaging.
# Base RandomForestClassifier
rf_estimator = RandomForestClassifier( random_state = 1,class_weight = 'balanced')
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', random_state=1)
# Model Performance evaluation - Base random forest classifier
df_perf_rf1_a = measure_performance(X_train,y_train,rf_estimator,'randomForest_train',9)
df_perf_rf1_b = measure_performance(X_test,y_test,rf_estimator,'randomForest_test',10)
print ("************Model Performance****************")
print(df_perf_rf1_a)
print(df_perf_rf1_b)
Accuracy:1.000
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 3382
1 1.00 1.00 1.00 790
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Confusion Matrix
[[3382 0]
[ 0 790]]
Accuracy:0.897
Classification report:
precision recall f1-score support
0 0.90 0.98 0.94 1389
1 0.89 0.62 0.73 399
accuracy 0.90 1788
macro avg 0.89 0.80 0.83 1788
weighted avg 0.90 0.90 0.89 1788
Confusion Matrix
[[1358 31]
[ 153 246]]
************Model Performance****************
num model_name Precision Recall Accuracy F1_score
9 9 randomForest_train 1.0 1.0 1.0 1.0
num model_name Precision Recall Accuracy F1_score
10 10 randomForest_test 0.893415 0.797112 0.897092 0.832181
Random Forest Classifier Hyperparameter Tuning¶
# Random Forest classifer initialization for hyperparameter tuning
rf_estimator_tuned = RandomForestClassifier(random_state = 1)
# Grid of parameters to choose from
params_rf = {
"n_estimators": [5,7,9],
"min_samples_leaf": np.arange(2,7,1),
"max_features": [0.2, 0.5,0.7, 'auto'],
}
import sklearn.metrics as metrics
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
print(grid_obj.best_params_)
{'max_features': 0.7, 'min_samples_leaf': np.int64(4), 'n_estimators': 7}
# Random Tree classifer with best parameters identified through GridSearch
rf_estimator_tuned = RandomForestClassifier(random_state = 1,n_estimators=grid_obj.best_params_['n_estimators'],min_samples_leaf=grid_obj.best_params_['min_samples_leaf'],max_features=grid_obj.best_params_['max_features'])
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(max_features=0.7, min_samples_leaf=np.int64(4),
n_estimators=7, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_features=0.7, min_samples_leaf=np.int64(4),
n_estimators=7, random_state=1)df_perf_rf2_a = measure_performance(X_train,y_train,rf_estimator_tuned,'randomForest_train_w_GV',11)
df_perf_rf2_b = measure_performance(X_test,y_test,rf_estimator_tuned,'randomForest_test_GV',12)
print ("************Model Performance****************")
print(df_perf_rf2_a)
print(df_perf_rf2_b)
Accuracy:0.951
Classification report:
precision recall f1-score support
0 0.95 0.99 0.97 3382
1 0.93 0.79 0.86 790
accuracy 0.95 4172
macro avg 0.94 0.89 0.91 4172
weighted avg 0.95 0.95 0.95 4172
Confusion Matrix
[[3338 44]
[ 162 628]]
Accuracy:0.883
Classification report:
precision recall f1-score support
0 0.90 0.96 0.93 1389
1 0.81 0.62 0.70 399
accuracy 0.88 1788
macro avg 0.86 0.79 0.81 1788
weighted avg 0.88 0.88 0.88 1788
Confusion Matrix
[[1332 57]
[ 152 247]]
************Model Performance****************
num model_name Precision Recall Accuracy F1_score
11 11 randomForest_train_w_GV 0.944119 0.890963 0.950623 0.914582
num model_name Precision Recall Accuracy F1_score
12 12 randomForest_test_GV 0.855037 0.789005 0.88311 0.814978
# combined performance from Random Forest classifier results
df_perf_rf_1 = pd.merge(df_perf_rf1_a,df_perf_rf1_b,how='outer')
df_perf_rf_2 = pd.merge(df_perf_rf2_a,df_perf_rf2_b,how='outer')
df_perf_combd_rf_all = pd.merge(df_perf_rf_1,df_perf_rf_2,how='outer')
# combining Decision tree and Random forest classifier results
df_perf_combd_dt_rf = pd.merge(df_perf_combd_all,df_perf_combd_rf_all,how='outer')
df_perf_combd_dt_rf
| num | model_name | Precision | Recall | Accuracy | F1_score | |
|---|---|---|---|---|---|---|
| 0 | 1 | Logistic Regression Train | 0.783357 | 0.627348 | 0.842522 | 0.656281 |
| 1 | 2 | Logistic Regression Test | 0.765874 | 0.635417 | 0.816555 | 0.659655 |
| 2 | 3 | Logistic Regression w. RFE Train | 0.653511 | 0.707224 | 0.755753 | 0.666418 |
| 3 | 4 | Logistic Regression w RFe Test | 0.650406 | 0.688414 | 0.730984 | 0.659554 |
| 4 | 5 | Decision Tree Train | 0.776443 | 0.817409 | 0.863615 | 0.793676 |
| 5 | 6 | Decision Tree Test | 0.806906 | 0.831450 | 0.868568 | 0.817922 |
| 6 | 7 | DecisionTree_train_GV | 0.831146 | 0.822709 | 0.895014 | 0.826820 |
| 7 | 8 | DecisionTree_test_GV | 0.833376 | 0.798803 | 0.877517 | 0.813973 |
| 8 | 9 | randomForest_train | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 9 | 10 | randomForest_test | 0.893415 | 0.797112 | 0.897092 | 0.832181 |
| 10 | 11 | randomForest_train_w_GV | 0.944119 | 0.890963 | 0.950623 | 0.914582 |
| 11 | 12 | randomForest_test_GV | 0.855037 | 0.789005 | 0.883110 | 0.814978 |
# Random Forest Classifier - Feature Importance
importances_rf = rf_estimator.feature_importances_
importance_df_rf = pd.DataFrame(importances_rf, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
# Random Forest Classifier with Tuning - Feature Importance
importances_rf2 = rf_estimator_tuned.feature_importances_
importance_df_rf2 = pd.DataFrame(importances_rf2, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
# Importance Features from each of Decision Tree and Random Forest Classifier
fig, axes = plt.subplots(3, 2, figsize=(15, 15)) # Adjust figsize for better spacing
sns.barplot(x=lg_feature_importance_1['Odds Ratio'], y=lg_feature_importance_1.Feature, ax=axes[0,0])
axes[0,0].set_title('**Feature Importance - Logistic Regression (Odds Ratio)**')
sns.barplot(x=rfe_feature_importance_1['Odds Ratio'] ,y=rfe_feature_importance_1.Feature, ax=axes[0,1])
axes[0,1].set_title('**Feature Importance - Logistic Regression w Recurssive Feature Elimination (Odds Ratio)**')
axes[0,1].set_ylabel('Features')
sns.barplot(x= importance_df_md_3.Importance, y=importance_df_md_3.index,ax=axes[1,0])
axes[1,0].set_title('**Feature Importance - Decision Tree Classifier - Max depth =3**')
sns.barplot(x=importance_df_gv.Importance,y=importance_df_gv.index,ax=axes[1,1])
axes[1,1].set_title('**Feature Importance -Decision Tree Classifier - Grid Search Max Depth =6 , Min sample leaf=3**')
sns.barplot(x= importance_df_rf.Importance, y=importance_df_rf.index,ax=axes[2,0])
axes[2,0].set_title('**Feature Importance -Random Forest Classifier**')
sns.barplot(x= importance_df_rf2.Importance, y=importance_df_rf2.index,ax=axes[2,1]);
axes[2,1].set_title('**Feature Importance - Random Forest Model - Grid Search**')
plt.tight_layout()
plt.show()
Insights¶
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?
3 different types of classification models, namely Logistic Regression, Decision Tree classifier and Random Forest classifier were trained and validated.
- Logistic regression model has accuracy of 81% but Recall were lower at 52%. There was drop in performance on test data with 4% drop in accuracy to 78% and recall remained around 51%.
- Logistic regression with Recurssive feature elimination has accuracy 75% and recall of 70% and has validates well on test dataset
- The base Decision Tree classifier with max depth of 3 has accuracy of 76% and recall of 62%. The performacne on test dataset was 5% drop in accuracy and recall dropped to 59%
- Decision tree classifier developed with Hyper-parameter tuning using Grid search and using cross-validation showed improved accuracy of 90%% and Recall of 92%. The test data showed drop in Accuracy and Recall dropped 80%
- Based Random Forest classifier show significant overfitting and on applying hyper-parameter tuning the accuracy was much 86% and was still showing overfitting. Based on Accuracy and recall, Random Forest with Grid search is provide better ability capture the defaults but its overfitting training data. As compared to this Logistic regression with recursive feature elimination has good accuracy and recall also validate well on test dataset.
2. Refined insights:
- What are the most meaningful insights relevant to the problem?
Logistic Regression modeling technique for classifying default vs. non-defaulting loans shows overall good performance with acccuracy ~75%, recall of ~70% and performance on test data is comparable to training dataset. Loans with following attributes have high likelihood of default:
- Presence DElinquency
- Presence of Prior derogatory behavior
- Higher Debt to Income ratio
- Number of Inquiries
- Age of oldest tradeline opened
- Number of existing trades
- Job type of Sales
3. Proposal for the final solution design:
- What model do you propose to be adopted? Why is this the best solution to adopt?
Given lower accuracy ,lower recall and overfitting , I propose to evaluate
- Look at adjusting hyperparameters for decision tree and random forest as well refinement of Logistic regression model.
- Evaluate additional model performance metrics and also checking how effective is model in providing transparency for regulatory purpose and how to it can provide underwriting efficiency.
#! jupyter nbconvert --to html '/content/drive/My Drive/GL_MIT_ADSP/Week9_Capstone/ChetanaShah_Capstone_Project_Notebook_Loan_Default_Prediction_Full_Code.ipynb'
********************** ***For Final Submission*****************************¶
1) Executive Summary¶
What are the important findings from the analysis done in Milestone?
- The analysis is based on existing loans data with 5,690 observations and 13 variables.
- The bad rate for the dataset is 19.9%
- The data has 11 variables with missing values. All the numeric variables showed right skewed distribution and presennce of outliers.
- For numeric variable, average for particular variable was used to impute the missings and for categorical variables, mode was used for imputation
- For variables with highly skewed variables like
Describe the final propose model specifications -Logistics Regression, Decision trees and Random Forest techniques were solve the classification problem. Logistic Regression model had lower accuracy at ~74% and recall at 71% -Models based on Decision tree and Random Forest has accuracy but were overiftting the training data. -Decision Tree with Hyperparameter Tuning improved accuracy but still had overfitting.
-With need for interpretability of the model, Decision tree based model is proposed for final model solution.
Brief summary of Data exploration
# Missing Rate
tmp = pd.DataFrame(data.isna().sum()/data.shape[0]*100,columns=['Missing_Rate'])
tmp.index.name = 'Feature'
tmp.sort_values(by='Missing_Rate',ascending=False)
plt.figure(figsize=(7,5))
mv2 = sns.barplot(data=tmp[2:],x='Feature',y='Missing_Rate')
for p in mv2.patches:
mv2.annotate(f'{int(p.get_height())}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 5), textcoords='offset points')
plt.xticks(rotation=90)
plt.title('Missing Rate in ')
plt.show()
# Key # of Loans and BAd Rate
fig, axes = plt.subplots(2, 2, figsize=(12,10)) # Adjust figsize for better spacing
sns.barplot(data=hm['JOB'].value_counts('normalize'),ax=axes[0,0])
axes[0,0].set_title('**Job Type**')
(pd.crosstab(hm['JOB'], hm['BAD'], normalize = 'index')*100).plot(kind = 'bar', stacked = True,ax=axes[0,1],rot=0)
axes[0,1].set_title('**Job Type vs Bad Rate**')
sns.barplot(data=hm['REASON'].value_counts('normalize'),ax=axes[1,0])
axes[1,0].set_title('**LOAN REAsON**')
(pd.crosstab(hm['REASON'], hm['BAD'], normalize = 'index')*100).plot(kind = 'bar', stacked = True,ax=axes[1,1],rot=0)
axes[1,1].set_title('**REASON vs Bad Rate**')
plt.show()
Final Proposed Model Solution
# Feature Importance for Decision Tree Model with Hyper parameter tuning
sns.barplot(x=importance_df_gv.Importance,y=importance_df_gv.index)
plt.title('**Feature Importance -Decision Tree Classifier - Grid Search Max Depth =6 , Min sample leaf=3**')
plt.figure(figsize=(20,10))
plot_tree(dt_1,feature_names=X_train.columns,filled=True,fontsize=12)
plt.tight_layout()
plt.show()
# ROC comparison
# Logistic Regression Model
fpr1, tpr1, _ = roc_curve(y_test, rfe.predict(X_test))
roc_auc1 = auc(fpr1, tpr1)
# Decision Tree Max depth 3
fpr2, tpr2, _ = roc_curve(y_test, dt.predict(X_test))
roc_auc2 = auc(fpr2, tpr2)
# DEcision tree w/ Grid Search MAx Depth and Min sample Leaf
fpr3, tpr3, _ = roc_curve(y_test, dt_1.predict(X_test))
roc_auc3 = auc(fpr3, tpr3)
# Decision Tree w/ Max Depth and Max leaf nodes
fpr4, tpr4, _ = roc_curve(y_test,rf_estimator_tuned.predict(X_test))
roc_auc4 = auc(fpr4, tpr4)
plt.figure(figsize=(7,7))
plt.plot(fpr1, tpr1, color='orange', lw=2, label='Logistic Regression ROC (area = %0.2f)' % roc_auc1)
plt.plot(fpr2, tpr2, color='lightblue', lw=2, label='Decision Tree Max depth 3 ROC (area = %0.2f)' % roc_auc2)
plt.plot(fpr3, tpr3, color='violet', lw=2, label='Decision Tree Grid Search (area = %0.2f)' % roc_auc3)
plt.plot(fpr4, tpr4, color='navy', lw=2, label='Random Forest (area = %0.2f)' % roc_auc4)
plt.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--') # Random classifier line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Comparison')
plt.legend(loc="lower right")
plt.show()
Final Model Solution¶
# Refining the Decision tree model with Grid Search to evaluate thc combination of max depth and max leaf and controlling the criteria='Gini' and min sample leaf to 3 to improve accuracy and reduce overfitting
dt_3 = DecisionTreeClassifier(random_state = 1)
param_grid_2 = {'max_depth' : np.arange(2,10,2),
'criterion': ['gini'],
'max_leaf_nodes' :np.arange(3,21,2),
'min_samples_leaf':[3]}
#use Grid Search
gridSearch_2 = GridSearchCV(dt_3,param_grid_2, scoring='recall_weighted',cv=5)
gridSearch_2.fit(X_train,y_train)
best_params_2 = gridSearch_2.best_params_
best_recall_2 = gridSearch_2.best_score_
# print the best parameters for max depth and min number of nodes
print(best_params_2)
print(best_recall_2 )
{'criterion': 'gini', 'max_depth': np.int64(6), 'max_leaf_nodes': np.int64(19), 'min_samples_leaf': 3}
0.88254972070248
# Decision Tree model based on best parameters identified by Grid Search
dt_3 = DecisionTreeClassifier(random_state = 1, max_depth =best_params_2['max_depth'], max_leaf_nodes = best_params_2['max_leaf_nodes'],min_samples_leaf= best_params_2['min_samples_leaf'],criterion=best_params_2['criterion'])
dt_3.fit(X_train, y_train)
df_perf_dt3_a = measure_performance(X_train,y_train,dt_3 ,'Decision Tree GV Refined',13)
df_perf_dt3_b = measure_performance(X_test,y_test,dt_3 ,'Decision Tree GV Refined',14)
print ("************Model Performance****************")
print(df_perf_dt3_a)
print(df_perf_dt3_b)
Accuracy:0.892
Classification report:
precision recall f1-score support
0 0.93 0.93 0.93 3382
1 0.71 0.72 0.72 790
accuracy 0.89 4172
macro avg 0.82 0.83 0.83 4172
weighted avg 0.89 0.89 0.89 4172
Confusion Matrix
[[3155 227]
[ 222 568]]
Accuracy:0.883
Classification report:
precision recall f1-score support
0 0.91 0.94 0.93 1389
1 0.76 0.68 0.72 399
accuracy 0.88 1788
macro avg 0.84 0.81 0.82 1788
weighted avg 0.88 0.88 0.88 1788
Confusion Matrix
[[1305 84]
[ 126 273]]
************Model Performance****************
num model_name Precision Recall Accuracy F1_score
13 13 Decision Tree GV Refined 0.824363 0.825934 0.892378 0.825145
num model_name Precision Recall Accuracy F1_score
14 14 Decision Tree GV Refined 0.838328 0.811868 0.88255 0.823877
# combined performance from Random Forest classifier results
df_perf_dt3_combd = pd.merge(df_perf_dt3_a,df_perf_dt3_b,how='outer')
df_perf_combd_all = pd.merge(df_perf_combd_dt_rf ,df_perf_dt3_combd,how='outer')
df_perf_combd_all
| num | model_name | Precision | Recall | Accuracy | F1_score | |
|---|---|---|---|---|---|---|
| 0 | 1 | Logistic Regression Train | 0.783357 | 0.627348 | 0.842522 | 0.656281 |
| 1 | 2 | Logistic Regression Test | 0.765874 | 0.635417 | 0.816555 | 0.659655 |
| 2 | 3 | Logistic Regression w. RFE Train | 0.653511 | 0.707224 | 0.755753 | 0.666418 |
| 3 | 4 | Logistic Regression w RFe Test | 0.650406 | 0.688414 | 0.730984 | 0.659554 |
| 4 | 5 | Decision Tree Train | 0.776443 | 0.817409 | 0.863615 | 0.793676 |
| 5 | 6 | Decision Tree Test | 0.806906 | 0.831450 | 0.868568 | 0.817922 |
| 6 | 7 | DecisionTree_train_GV | 0.831146 | 0.822709 | 0.895014 | 0.826820 |
| 7 | 8 | DecisionTree_test_GV | 0.833376 | 0.798803 | 0.877517 | 0.813973 |
| 8 | 9 | randomForest_train | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 9 | 10 | randomForest_test | 0.893415 | 0.797112 | 0.897092 | 0.832181 |
| 10 | 11 | randomForest_train_w_GV | 0.944119 | 0.890963 | 0.950623 | 0.914582 |
| 11 | 12 | randomForest_test_GV | 0.855037 | 0.789005 | 0.883110 | 0.814978 |
| 12 | 13 | Decision Tree GV Refined | 0.824363 | 0.825934 | 0.892378 | 0.825145 |
| 13 | 14 | Decision Tree GV Refined | 0.838328 | 0.811868 | 0.882550 | 0.823877 |
ROC Curve¶
# Logistic Regression Model
fpr1, tpr1, _ = roc_curve(y_test, rfe.predict(X_test))
roc_auc1 = auc(fpr1, tpr1)
# Decision Tree Max depth 3
fpr2, tpr2, _ = roc_curve(y_test, dt.predict(X_test))
roc_auc2 = auc(fpr2, tpr2)
# DEcision tree w/ Grid Search MAx Depth and Min sample Leaf
fpr3, tpr3, _ = roc_curve(y_test, dt_1.predict(X_test))
roc_auc3 = auc(fpr3, tpr3)
#Random Forest with grid search
fpr4, tpr4, _ = roc_curve(y_test,rf_estimator_tuned.predict(X_test))
roc_auc4 = auc(fpr4, tpr4)
# Decision Tree w/ Max Depth and Max leaf nodes
fpr5, tpr5, _ = roc_curve(y_test, dt_3.predict(X_test))
roc_auc5 = auc(fpr5, tpr5)
plt.figure(figsize=(7,7))
plt.plot(fpr1, tpr1, color='orange', lw=2, label='Logistic Regression ROC (area = %0.2f)' % roc_auc1)
plt.plot(fpr2, tpr2, color='lightblue', lw=2, label='Decision Tree Max depth 3 ROC (area = %0.2f)' % roc_auc2)
plt.plot(fpr3, tpr3, color='violet', lw=2, label='Decision Tree Grid Search (area = %0.2f)' % roc_auc3)
plt.plot(fpr4, tpr4, color='navy', lw=2, label='Random Forest Grid Search (area = %0.2f)' % roc_auc4)
plt.plot(fpr5, tpr5, color='red', lw=2, label='Decision Tree Grid Search Refined (area = %0.2f)' % roc_auc5)
plt.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--') # Random classifier line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Comparison w/ Final Model Solution')
plt.legend(loc="lower right")
plt.show()
# Plote the Decision Tree with Grid Search with max depth of 5 ,19 max leaf nodes and atleast 3 sample leaf
plt.figure(figsize=(25,10))
plot_tree(dt_3 ,feature_names=X_train.columns,filled=True,fontsize=12)
plt.tight_layout()
plt.show()
# Feature importance plot
importance_dt3_gv =dt_3.feature_importances_
importance_df_dt3_gv = pd.DataFrame(importance_dt3_gv, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
sns.barplot(x=importance_df_dt3_gv.Importance,y=importance_df_dt3_gv.index)
<Axes: xlabel='Importance', ylabel='None'>
Model Interpretability using SHAP values
import shap as shap
shap.initjs() # Initialize JavaScript for interactive plots
# Fits the explainer
explainer = shap.Explainer(dt_3.predict, X_test)
# Calculates the SHAP values - It takes some time
shap_values = explainer(X_test)
PermutationExplainer explainer: 71%|███████▏ | 1275/1788 [00:41<00:15, 32.62it/s]
# Feature impact
shap.summary_plot(shap_values)
# Feature impact - on a particular loan
shap.plots.bar(shap_values[0])
2) Problem and Solution Summary¶
Provide the summary of the problem
- The bank was looking to simplify their applicant approval process and wanted to build classification model which met following requirements
- based on their existing loan data
- developed using sound statistical analysis
- interpretable and one that would provide justification for adverse behavior
State the reason for the proposed solution design
Different classification techniques were evaluated and model performance were compared.
The final solution was proposed after evaluating soundness and stability of models
Decision tree-based model was selected as it easy to interpret and had strong performance (with accuracy of 88% and Recall of ~81%) and showed comparable performance on test data
How it would affect the problem /business?
- On implementing the model, the bank can introduce automated decision making in their application approval process, standardize it and reduce errors due to human judgement
- With use of model in decision approval process, the applicant would need to enter data to get the score or decision . This will improve data that bank collects on the borrowers and also help in reducing defaults
- Bank can also focus time on verifying applications which are eligible for approval based on the model rather than all the application.
Overall, use of model will enable bank improve their decision making process and if appropriately use bank can reduce the losses
3) Recommendations for implementation¶
What are some key recommendations to implement the solution?
The stakeholders need to define threshold or cut-off based on the new model to approve and decline customer.
There would be need to train underwriting team to know how to use the model in approval process and how to respond to questions related to model-based declined.
The model could be introduced as challenger to current process to evaluate its effectiveness in use and check if there is need to make any any adjustments before fully adopting the model for decision making.
What are the key actionables for stakeholders?
The stakeholders need to monitor effectiveness of the cut-off and may need to re-evaluate the cut-off on regular basis.
The stakeholders need to introduce model monitoring to analyze the population stability.
- The model is developed on existing loans but if ther bank is undergoing any changes in customer base or due to better collection observes shift in population as compared to data used to develop the model, then there is need to re-build model.
What is the expected benefit and/or costs (List the benefits of the solution. State some rational assumptions to put forward some numbers on costs/benefits for stakeholders)?
What are the potential risks or challenges of the proposed solution design?
The application data required for model needs to be gathered and calculation should be correctly done before using it as model inputs. If there are errors or missing data, model decisions will be inaccurate.
Models that are incorrectly implemented or modified without stakeholder awareness—either prior to testing/validation or by members of the implementation team—can lead to unintended consequences in the application process."
What further analysis needs to be done or what other associated problems need to be solved?
As the data gathering and performance based application approved using new model is available, there is need to monitor effectiveness of current model and rebuild the model if needed
There is need to evaluate effectiveness of model on the rejects. The stakeholders need to look at validating the model performance on all application rather than only existing customers.
! jupyter nbconvert --to html '/content/drive/My Drive/GL_MIT_ADSP/Week9_Capstone/ChetanaShah_Capstone_Project_Notebook_Loan_Default_Prediction_Full_Code.ipynb'